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The Immunoglobulin Fold 
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EMBL, Meyerhofstrafie 1, 0-69012 Heidelberg, Germany 

Since the first crystal structure of an immunoglobulin revealed a modular architecture, the 
characteristic ^-sheet fold of the iramunoglobulin domain haa been found in many other 
proteins of diverse biological function. Here, a systematic comparison of 23 Ig domain 
structures with less than 25% pairwise residue identity was performed using automatic 
structural alignment and analysis of /J-sheet and loop topology Sequence consensus patterns 
were identified for nine distinct families mth at most marginal similarity to each other. The 
analysis reveals a common structural core of only four ^-strands (ii, c, e and /), embedded in 
an antiparallel curled J?-sheet sandwich with a total of three to five additional strands (a, c\ 
c'fd, g) and a characteristic intersheet angle. The variation in the position of the edge strands 
(a, c', c", d and g) relative to the common core defines four different topological subtypes that 
correlate with the length of the intervening sequence between strands c and c, the most variable 
region in sequence. The switch of strand c' from one sheet to the other in seven- stranded 
domains appears to result from short c-e segments, rather than being a major structural 
discriminator. The high degree of structural flexibility outside the common core and the 
extreme variability of side-chain packing inside the core do not support a protein folding 
pathway common to all members of the structural class. Mutation rates of immunoglobulin- like 
domains in different proteins vary considerably. Disulfide bridges, thought to contribute to 
structural stability, are not necessarily invariant in number and location within a subclass. 
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1. Introduction 

The exponential growth of sequence databases and 
the drastic increase in published tertiarj' structures 
have revealed an increasing number of protein 
families with similar structure but with extremely 
divergent sequences (reviewed by Holm & Sander, 
1994). One of the most striking examples is the 
emerging class of immunoglobulin-like (Ig-like) 
domains. The folding topology of immunoglobulin 
constant and variable domains has been described 
as a Greek key ^-barrel, subclass "simple" 
(Richardson, 1981), The structural domains of 
immunoglobulin have seven to nine antiparallel 
^-strands forming a barrel-like shape. However, 
hydrogen bonds do not go around the barrel so 
that there are two distinct ^-pleated sheets and 
physically the fold is a ^-sandwich (Lesk & Chothia, 
1082). The class of simple Greek key proteins includes 
anumberof other proteins, e,g. superoxide dismutase, 
blue copper proteins, hemocyanin, but they have 
additional elements of secondary structure. This 
review aims at a classification of structures of 
immunoglobulin and non-immunoglobulin domains 
which nevertheless have the same topology as 
immunoglobulins, i.e- the same order and number of 
strands. 



Recently, Ig-like domains have been reported in 
numerous structures of non-immunoglobulins includ- 
ing (1) cell surface receptors such as CD2 (Driscoll 
et a/.. 1991; Jones etal., 1992), CD4 (Hyu et at., 1990; 
Wang et al., 1990; Garrett ei al., 1993; Brady et at., 

1993) . CDS (Leahy et al., 1992a). MHC/HLA (Saper 
et al., 1991 and references therein), growth hormone 
receptor (de Voset al, 1992), neuroglian (Hubere/ al., 

1994) , (2) matrix proteins such as tenascin (Leahy 
et al., 19926) and fibronectin (Main et al, 1992), (3) 
intracellular regulatory proteins such as the bacterial 
chaperonin PapD (Holmgren & Branden, 1989), as 
well as (4) enzymes such as cyclodextrin glycosyl- 
transferase ( Klein & Schulz, 1991 ), myosin light chain 
kinase (telokin; Holden et al, 1992), and galactose 
oxidase (Ito et al., 1991). These proteins have diverse 
functions. Based on distant sequence similarity, some 
of the domains were predicted to belong to the 
immunoglobulin superfamily in advance of structure 
determination (e.g. type ITT repeat of fibronectin 
(Fn3t; Bazan, 1990), in other cases the structural 
similarity was quite unexpected {e.g. FapD; Holm- 
gren & Branden, 1989). All Ig-like domains appear to 

t Abbrevialions used: Fn3, fibroncotin tyi'>c Ml; VDB, 
protein data bank; r.m.s., root-mean-squure; GHR, 
growth humionc receptor. 
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be involved in binding functions; not a single one is 
known to contain a natural enzyrae active site. 
Considering the vast number of sequence relatives, the 
Ig-Iike domain is probably the most widespread 
protein module, at least in animals (Doolittle & Bork, 
1993). Based on sequence information, 40% of the 
human leucocyte surface proteins were predicted to 
contain Ig-Iike domains {Barclay el al., 1992); this 
number might even increase considering the various 
sequence-unrelated proteins and their families 
reviewed here. 

Faced with the increasing functional, structural 
and sequence diversity of Ig-like domains, one 



wonders whether there are any conserved features 
common to all Ig-like domains. If there are, do they 
indicate a common folding principle or common 
folding pathway^ Is it possible to discriminate 
divergent evolution within this family from conver- 
gence towards an energetically stable fold? What is 
the correlation between sequence similarity and 
structural similarity in marginally related sequences? 
Td provide a basis for answering these questions we 
have undertaken an objective (automatic) multiple 
superimposition of all these domains and correlated 
the common and distinct features with functional and 
sequence information. 



Table 1 

Classification of atrttctures used in the analygis 
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Topology 
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{deg) 
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CD4 dnmaJn 2 
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Ryue* al. (1990) 




■^l^rhlL'tn nj^rWlJ^nii Tvu%on^/\1^ 
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s-type 
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153 
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l^n.h\.> ft fit t 


















lCID-4 






8-type 


2-8 


154 


9 


Rradv al fl903) 


002-2 


CD2 domain 2 


RaHua raiiu* 


8-type 


2-8 


160 


g 


Jones et al. (1992) 


FN3 


Pibronect4n 


Homo sapiens 


8-type 


NMR 


150 


Q 


Main e(a/.{1992) 


GN-l 


Neuroglial!, domainl 


Drosopkila melanogaster 


s-tj-pe 


2-0 


152 


9 


Huberef al. (1994) 


GN 2 


N'euroglian, domain 2 


Vrosophila melanogaster 


B-tv-pe 


20 


141 


10 


Huberef oi. (1994) 


IGOF 


Fungal g«lactutie oxidase. 


DactjfHum 


h-type 


1-7 


150 


12 


Itoelal. (1991) 




domain 3, res, 633-639 


dtndro itUs 












ICGT 


Cyiiloduxtrin glyoosyl- 


Bacillits circulatts 


li-tyix; 


20 


n-d.J 


13 


Klein &Schulz( 1991) 




trarisfcrase domain D, res. 
















495-580 














2HHR I 


Growth hormone receptor, 


Homo sapiens 


s-tj-pe 


28 


159 


16 


deVo^fitat, (1992) 




Domain 1 














ITLK 


Telokin 


Meleagris 


c/v-typct 


2-8 


ICO 


17 


Hotden et al. (1992) 






gaiUypavQ 












3DPA-I 


papD prot«in, domain 1. 


Escherichia cdi 


8-type 


2-5 


141 


17 


Holmgren & Branden (1989) 




FBH. I~I20 














3-FAB-VL 


Ig02a'kappa, variable 


Mmo musculus 


v-t>T)eg 


20 


154 


18 


Herron f.l ai. (1989) 




domain of light chain 














3FAB-CH 


lgG2a-kappa, constant 


Muso musculus 


ctype 


2.0 


162 


20 


Hcrron el al. (1989) 




domain of heavy chain 














CEL 


Cellulase CelD 


Clostridium 


h-tyj>e 


2-3 


141 


21 


Juy al. (1992} 






tkermocellum 










IFCI-C2 


Constant domun 2 


H(mo sapiens 


c/v-typet 




m 


i\ 


Dei*enhofer(l98l) 


IFCl-Ct 


Cnnfltant domain 3 


Homo sapiens 




2-9 


157 


21 


Oei5enhofer(19ai) 


3HLA 


ClaAs 1 histoiiompatib. 


Homo sapiens 


c-tyj>e 


2-6 


165 


21 


Saperef a/. (1991) 




antigen A2-1 












Garrett eta/. (1089) 


2HLA 


Class I hiittouompatib. 


Homo sapiens 


c-type 


2 6 


165 


22 




antigen Ae8- 1 














CD2-I 


CD2 domain 1 


Rattus ruUus 


v-typeS 


2-8 


168 


23 


Jonea et al. (1992) 


2RHIi:-VL 


Bence-Joneft Ig variable 


Homo sapiens 


v-type§ 


1 a 


152 


25 


FuKV ft al. (1083) 




domain 














1CD8-1 


CDS domain 1 


Homo sapiens 


v-type 


^•e 


150 


20 


Uahv^rof. (1992ra) 


lCID-3 


CD4 domain 3 


Hattus ratlus 


v-typei 


28 


157 


29 


Bradv«(a/. (1993) 


2CD4-1 


CFM domain 2 


Homo sapiens 


v-t>'pe§ 


2-3 


151 


32 


Ryu e/fl/. (1990) 


4FAB-VH 


IgG2a-kappa, variable 


Homo sapiens 


v-type 


20 


145 


34 


Herron al. (1989) 




domain of heavy chain 















t Tfelokin has 8 strands and is structurally aimilar to v-tj-pe but lacks 6*. Thus, it appears to be a hybrid betHven «-type and v-tj'pe. 
The constant domain 2 of tPCl Is most similar to c-tyi)o structure* but haa a very short strand tofxilogically equivalent to c'. 
X Not determined due to a ^ bulge involving the residues used to calculate the Ahcet-sheet angle. 
§ Strand a snitched over to Ahcet g-f-C'C'-c' . 

Proteins st^idied are identified by their PDB code, if available, followed by a hyphen and domain mnemonic The co-nrdinat«a for CD2. 
FN3, neuroglian and cellulase CelD* Vi-ere kindly provided by the authors. The structures aro ordered according to their loop lengths between 
the reference point c* and e. Note the correlation bet-ween the loop length and the structural subtype. There are 4 loops bet-ween the a 
segments of the structural wre (6-c, c-c*, c*-c, e-f). The number of residues in the e*-e segment is given in ooluran (aa c*-e). Sheet-sheet 
angles (column s-s deg) were defined ax the dihedral angle between vectors based at the midpoint of strands 6 and c and ofstranda r am) 
/, c<junting the residues of the common core only. The orientation of the vectors was the aum of all CA; to CA,-+ i vectors for (ttrand b and 
OA, to CA;, I vectors fur strand e, and similarly for strands c and /. 
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c - type V - type 

Figure 1. 2D topology diagrams of obscrvod hydrogen bonding patterns. The 7-9 strands [a, b, c, c' c', d, e,f, g) form 
a sandwich of 2 sheets (back sheet, 1, thin arrows; front sheet, II, thick arrows, whore back and front refer to Figure il; 
packing not evident and loop lengths not to scale in this projection). The common core (Figure 3) is shown in red. 
Immunoglobulin constant domains have 7 strands in a topology shown at the bottom left (c-type, for constant). 
Immunoglobulin variable domains have an additional hairpin (c'-c") between strands c and rf, with a total of 9 strands (v-type, 
for variable). Strand a has 2 alternative locations in v-typo domains, being part of either the back sheet (antiparallel pairing 
with Strand 6) orof the frontsheet (parallel pairing with strand {/). Other Ig-Uke domains also have 7 strands, butaredifTerent 
from c-type in that the 4th strand has switched sheets (s-typc, for switched); the name of the 4th strand changes from d 
to c' to reflect the sheet switch. The last type represents an 8-stranded hybrid between c- and s-type that has both strands 
c' (front sheet) and its direct continuation strand d {back sheet), so that both sheets have 4 strands (h type, for hybrid). 



2. Structure Comparison 

(a) Search for 3D structures closely related to Igs 

Structure comparisons were carried out using the 
program Dali, which maximizes a geometrical 
similarity score calculated from intramolecular 
distances in the common core (details in Holm & 
Sander, 1993). Structures similar to immunoglobulin 
domains were identified from an all-against-all 
comparison of a representative set of protein 
structures (Holm & Sander, 1993). This set was 
extended by Ig-like structures not yet in the Protein 
Data Bank (PDB; Bernstein etal., 1977). In addition 
to immunoglobulins, the selected set includes domains 
from HLA, CD2, CD4, CDS, fibronectin, tenascin, 
neuroglian, the growth hormone receptor, bacterial 
domains from a thermostable cellulase, cyclodextrin 
glycosyltransferase, PapD and a fungal galactose 
oxidase (Table I). Only three pairs in this set have 
sequenceidentitiesatoTabove25% (FN3 with ITEN, 
3FAB heavy constant domain with IFCl third 
domain, and the 2RHE and 3FAB light chain variable 
domains). In spite of the striking similarities between 
the first domain of PapD and Ig-like domains, the 
second domain of PapD is a complete outlier. It haa 
very low structure similarity when compared with all 
other domains in the Ig-like set, and was therefore 



excluded. Other proteins structurally most similar to 
the Ig-like set include blue copper proteins, 
actinoxanthin and superoxide dismutase. However, 
these were found to be topologically distinct 
(additional strands inserted between a and 6 in several 
blue copper proteins; an extra strand before strand a 
in superoxide dismutase) and geometrically different 
(poorer superimposition, different twist and tilt 
angles between the sheets) and were therefore 
excluded from the detailed comparisons. 

(b) Definition of topological subtypes 

Current classification schemes of classical Ig-like 
domains are mainly based on the number of strands 
and sequence similarity, (division into v, cl and c2 
sequence "sets"; Williams & Barclay, 1988; Hsu & 
Steiner, 1992). However, with the crystal structures of 
PapD and CD4 (Holmgren & Branden, 1989; Kyu 
et ah, 1990; Wang et al., 1990) it became obvious that 
the d-strand of the seven-stranded structures can 
switch between the two sheets (then called c'; Figure 
1) and many authors distinguish between the two 
forms. Ba<:kbone hydrogen bonding patterns (Kabsch 
& Sander, 1983) in the Ig-like domains indeed define 
two sheets rather than a closed barrel. The edges of 
the sheets are conformationally flexible (in particular, 
strands a, ^, c*, c" ). I n a number of cases a strand starts 
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in one sheet, then the chain bulges a bit but continues 
in the same direction ending as a strand in the second 
sheet. Most recently, comparison of telokin (Holden 
et ai, 1992) with other immunoglobulin-like domains 
such as CD4 and CDS lead to the proposal of a distinct 
"I-set" which can be detected at the sequence level by 
profile searches (Harpaz & Chothia, 1994). 

Based on the number of strands and the location of 
strand c^jd we can define at least four distinct 
subtypes (Figure 1) that are present in ourdataset of 
superimposed immunoglobulin-like domains: (I) 
c-type: classical seven- stranded topology of the 
constant domains in immunoglobulins (sheet I: 
d-e-b-a; sheet II: g-f-c); (2) s-type: seven-stranded 
strand-switche<l type (sheet I: e-fr-a, sheet II; g-f-c-c')', 

(3) h-type: a hybrid between ( 1 ) and (2), where strand 
c'fd is kinked and the N-terminal residues of this 
segment form hydrogen bonds with strand c (sheet 11) 
whereas theC-terminal residues belong to sheet I, and 

(4) v-type: a nine-stranded type as occurring in the 
variable domains of immunoglobuhns (sheet I: 
d-e-b-a; sheet II/ g-f-c-c'-c"). However, the location of 
strand a within sheet I or II varies and in the future 
a further classification may be applied to distinguish 



GN-2 A, 

1TEN m 
^ nCGT 
2HHR-1 A 

2CD4-2^ J| 
lClIl-4 ^ J] 


'A-1 

1CID-3# 

ICEL CD2-1 


CD2-2 • 
01FC1-CZ 


41 
ITl 


▲ S-type 
# v-type 

O c-type 

H h-type 



Figure 2. Clusters in structure apace. A multivariate 
statistical analysis method called corrcBpondence analysis 
(Hill, 1973) was used to represent all pairwise similarities 
between Ig-like domains in 2 dimensions. The distances in 
the plane approximately indicate structural dissimilarity. 
Adjacent points are most related in 3D structure. 
•Structures are labeled as in Table 1. The clustering into 
subtypes and the resulting classification into v-, c- and 
s-type is a straightforward consequence of the structural 
alignment method, without predefined notions of topologi- 
cal types. The Ist eigenvector (horizontal) discriminates 
between the9-stranded v-type and the 7*strandod domains. 
The 2nd eigenvector (vertical) discriminates between the 

7- 8tranded c-type and the 7-stranded strand -switched 

8- type. In a further dimension, all domains from enzymes 
(galactose oxidase, IGOF; cyclodextrin glycosyltrans- 
ferase, ICGT; and cellulase, GEL) form a separate group of 
h-type domains. Some subfamilies become visible such as 
CD4 domain 4 with CDa domain 2 which form a distinct 
subgroup of s-type domains. 



between the forms with sheets composed of 4 + 6 or 
3 + 6 strands. The length of strands c' and c" varies. 
In some cases, strand c" is just a loop and is beyond 
the threshold for detection by standard programs for 
secondary structure definition such as DSSP (Kabsch 
& Sander, 1983). 

Although we found no example in our dataset, in 
principle a switch of strand a in the seven-stranded 
sandwich (similar to that seen in nine-stranded 
sandwiches) should also be possible. Thus, we 
expect a variety of additional subtypes for which 3D 
structures are not yet available. 

(c) CoTTelaiion between topological subtype and 
structural similarity 

Alultivariate analysis of the pairwise structural 
similarity scores (Figure 2) reveads three or four main 
clusters although the proteins represent a larger 
number of sequence families (see below). The clusters 
in structure space correspond to the topological 
classes. Projection onto the first two eigenvectors 
separates clusters of c, v and s(h)-type domains 
(Figure 2). The third eigenvector separates the first 
domain of PapD as an outlier, the fourth eigenvector 
separates the Ig-like domains surrounding the 
catalytic domains of phylogeneticalty old enzymes 
(i.e. in cellulase, cyclodextrin glycosyltransferase and 
galactose oxidase) from the s-type cluster (data not 
shown). Closer inspection of the clusters reveals some 
interesting classifications: domain 2 of the human 
growth hormone receptor, not only domain I, is pulled 
towards the s-type Fn3 cloud; the T-cell receptor 
domains from CD2 and CD4 are surprisingly similar; 
and CDS clusters among v-type domains (Figure 2). 
Telokin, an eight-stranded structure, clearly clusters 
within the v-type family although it lacks the c" 
strand. Average linkage clustering reveals a relatively 
close resemblance between the groups of the 
seven- stranded c-type and the nine-stranded v-type 
domains (not shown). These observations indicate 
that the transformation between seven and nine- 
stranded topology only involves a local perturbation 
of structure. 

(d) Common struclurai core 

We have emphasized the structural flexibility and 
the presence of structural subtypes in the set of 
Ig-like domains under study The structural align- 
ments allow the definition of a common structural 
core as those residues which are aligned against a 
reference structure (e.g. the variable domain of Rhe) 
in all structures. The common core contained strands 
6, c, e and / plus a piece of strand c' or a piece of the 
c-d loop in classes that lack strand c\ a total of 31 
residues (Figure 3). The structural core does not 
contain the conformationally flexible (between 
families) edge strands. The reason for this is that the 
non-central strands can be shifted structurally 
between different pairs of domains. The lengths of the 
strands and surrounding loop regions vary extremely 
(Table I). 
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M r:, ITEN (s-typc) 2HHR-1 (ft-type) 3DPA-I (Mype) 

(a) (»>) 

FiguK 3. The common core of Ig-like domains, (a) The 2-plu8-2 stranded structural core {strands 6, c, «,/) common to 
all Ig-like domains (red) in surrounded by structurally more variable strands (grocn). The front sheet has up to 5 strands 
{g-f-c-c'-c"), the ba^k sheet up to4 (a-fe-c-rf). Strandc" is very flexible in Ig variable domains and does not always form /(-strand 
type hydrogen bonds. The common core was defined using a multiple structural alignment generated by the program 
Dali (details in Holm & Sander, 1993). Ribbon by Molscript (Kraulis, 1991). Selected examples from different subtypes 
are shown in (b). Note that strand definitions according to DSSP do not always correspond to structurally equivalent 
positions. 



(e) Correlation between loop length and sheet 
switching 

The polypeptide chain segment between strands c 
and e has to change direction by at least 180 <iegrees 
and cross over to the other sheet. There is a correlation 
between the number of residues between strand e and 
e and the topological subtype. Our structural 
comparison revealed a reference point at the 
beginning of strand c' or in the c-d loop, respectively 
(Figure 1, Table 1) that could be superimposed in all 
pairs (short red loop segment preceding strand c' in 
Figure 3). The following observations consider the 
sequences between this reference point, hereafter 
called c*, and the beginning of strand e. 

In the structures of all Fn3 domains as well as the 
second domains of CD2, CD4 and the first domain of 
PapD, strand c' is hydrogen bonded to sheet T {thick 
in Figure 1, see also Figure 3). These structures 



(S'type) have only seven to ten residues between c* 
and c. compared to typically well over 20 residues in 
c-type domains. The hybrid-type domains (intermedi- 
ates between s- and c-type) of galactose oxidase and 
cyclodextrin gly cosy transferase have 12-13 residues 
between c* and e. The topologies of c-type and v-type 
are verj' similar if the two additional strands c' and r/ 
of v-types are treated as a long insertion between c 
and f! that forms a stable /5-hairpin. Following this 
idea, telokin (Holden el al., 1992) is an intermediate 
between c- and v-type: 16 residues between c* and e 
are not sufficient for a c* strand. Thus we see a 
tendency for the segment between c* and e to prefer 
backbone- hydrogen-bonded structure over coil, fold- 
ing into a ^-strand or hairpin which is appended to the 
sheet nean^st at hand. The correlation is illustrated in 
Table 1: the length of the intervening sequence 
between o* and the e strand increases from s- to h- to 
c- to v-type structures. 
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3. Sequence Comparison 

(a) Sequence database ftearches 

The proteins of known 3D structure reviewed here 
represent seven distinct sequence families (consider- 
ing the immunoglobulin subtypes as being one 
family). Two sequence families, which have been 
predicted to contain the Tg-fold, namely interferon 
receptors and CUB domains (Bazan, 1990; Bork & 
Beckmann, 1993) were added although their 
structures have not yet been solved. Figure 4 shows 
the derived conserved features of each sequence 
family. The number of sequences that are available in 
current sequence databases varies from one family to 
another (Figure 4). The most abundant families 
include the domains of immunoglobulins. More than 
1000 members of the Igc family are already stored in 
current sequence databases. ^Nevertheless, the con- 
sensus (determined at a 70% conservation level) 
clearly shows conserved hydrophobic features 
(Figure 4). 

(b) Comparison of consensus sequences of the 
families 

The presence of a conserved structural core does 
not imply a similar hydrophobicity pattern among 
the different sequence families. Only strand /appears 
to retain clear conserved hydrophobic features in all 
sequence families. H remains unclear whether this 
requirement is needed to initiate or stabilize folding 
of the ^-sandwich. Turn formation has been proposed 
to be a key step in folding as an initiator of the zipping 
up of ladder structures. Inspection of the loops 
between the sequence families revealed not a single 
loop conserved in length throughout the families that 
could specifically initiate folding. Other features, such 
as conserved residue properties in loop segments, 
could not be detected either. Even within sequence 
families loops are not necessarily conserved and can 
vary in length and amino acid composition. 



In most of the sequence families shown in Figure 4, 
aromatic residues are conserved near the termini of 
the core ^-strands, although they do not correspond 
to topologically equivalent strands. Nevertheless, the 
presence of aromatic residues might be required to 
form a stable hydrophobic core. Aromatic residues 
have been proposed to play a special role in the design 
of antiparallel ^-sandwiches. Their large size and 
specific geometry make thera the best candidates to 
fill some specific positions, and because of their low 
conformational entropy they have been suggested to 
possibly play a role as folding nuclei (Finkelstein & 
Nakamura, 1993). Aromatic residues are strongly 
conserved within families but not between families. 
This conservation is particularly strong for families 
that lack stabilizing disulfide bridges. 



(c) Disulfide bridges 

One of the original hallmarks of an Ig-like fold was 
a conserved disulfide bridge between strands b and/ 
(T^esk & Chothia, 1982). It is, however, not necessary 
to determine the Ig-fold. Even within the Ig-like 
sequence family these cysteine residues might be 
completely absent (see Williams & Barclay, 1988; and 
references therein) as, for example, shown for domain 
1 of CD2 and domain 3 of CD4 (Jones el al., 1992; 
Brady et al., 1993). In other domains, the disulfide 
bridges have moved within the core. For example, 
domain 2 of CD4 forms a disulfide bond between 
strand c and / (Ryu et al, 1990; Wang et al., 1990) 
instead of between strand 6 and / as in classical Ig 
domains; the second domain of CD2 has in addition 
to the b-f bond a disulfide bridge between strands o 
and g (Jones et al., 1992), 

The number and location of disulfide bridges also 
varies in families without sequence similarity to Iga. 
The first domain of growth hormone receptor 
contains three disulfide bridges, other members of the 
homeopoietic receptor family (C4 in Figure 4; Baaan, 
1990) only two. Alignment of all these sequence- 



Figure 4. Comparison of distinct sequence families with Jg-like topologj'. The consensus sequence of each family is 
represented by a string of symbols (capitals, conserved amino acids; lower case, h is hydrophobic; t is turn-like or polar; 
a is aromatic; o is OH group = S/T; + / - is charged). Positions without any obvious consensus are indicated by a dot, .; 
do not interpret dots as gaps. Gaps are denoted by underscore, (J, surrounded by numbers that represent the variability 
of loop lengths within I sequence family. Tn some families, the terminal strands are too variable for a consensus to be derived; 
these parts have been omitted. The numbers at the right(occ, for occupancy) are estimates of the numberof family members 
in the current protein databases. Bullets above the sequences mark structurally equivalent core positions (residues in the 
common core that make major intersheet contacts). U>wer case letter strings below the sequences are strand labels, as in 
Figures I and 3. Database searches were carried out using FASTA (Pearson & Lipman, 1988). The identified proteins or 
domains were aligned and both profiles (Gribskov «lo/, 1987) and property patterns (Rohde& Bork, 1993) were constructed. 
The procetiure was conducted by iteratirely adding new identified members to the multiple alignment (for details, see Bork, 
19!)3). All exception was mado for the vast number of proteins sequence- related to immunoglobulins. Here, the classification 
of Williams& Barclay (1988) wasused and 4 subfamilies were defined: Igcl , Igc2, Tgv and domains without disulfide bridges. 
Rven if constant and variable domains show sequence similarity to each other they can be well distinguished because of 
their different number of amino acids between strand c and e. The searching scheme resulted in a distinction of 7 sequence 
families of known structure which could not be fused by either sequence profile or property pattern methods. In addition 
to families with known structure, the sequence analysis procedure was applied to interferon receptors and CUB domains, 
for which an Ig-like to|>oIogy has been projx)sed (Bazan, 1990; Bork & Beckmann, 1993). From the resulting alignment 
the consensus lines were derived; amino acids or properties that arc conserved in more than 70% of the sequences of a 
particular family are displayed. 
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Table 2 

Mutation rates of Ig-like domains in vertebrates 



Domain 


Code 


Human verAus 
bovine/pig 


Human versus 
mouse/ rat 


Human vktsiu 
chicken 


Disulfide 
bridges 


3rcl in tenascin 


ITEN 


96% 


89% 


86% 


No 


lOth in fibmnectin 


FN*3 


m% 


86% 


83% 


No 


GHR domain 2 


2HHR-2 


87% 


70% 


69% 


No 


GHR domain I 


2HHR-1 


83% 


69% 


62% 


Yes 


Microglobulin 


3HLA 


76% 


71% 


47% 


Yes 


CD4 dumain 3 


lCID-3 




59% 




No 


CI>4 domain 4 


lCCD-4 




56% 




Yes 


CD2 domain 2 


CD2-2 




56% 




Yes 


CD4 domain 1 


2CD4-I 




52% 




YCA 


C04 domain 2 


2CD4-2 




47% 




Yes 


CD2 domain 1 


CD2-I 




41% 




No 


CDS 


ICD8-1 


66% 


41% 




Yen 



The domains of our daiaset (for abbreviations see Table 1) were compared with their putative 
orthologues from other vertebrates if available. For some of them quantitative comparisons were 
omitted as too many paralogues exist {e.g. Igs or 2HLA). Human sequenwa are compared with rodents 
(mouse, rat), arttwUctyla (bovine/ pig) and chicken. The sequence identities are averaged if n^tiueneen 
from 2 species of the respective group vmn found. Fn3 domains appear to be more conserved than 
other Ig-like domains. 



related domains reveal a different location of 
disulfide bridges within this family as well (data 
not shown). Structure comparison placed the C4 
family in the close neighborhood of another 
wide-spread sequence family, the Fn3 repeats. The 
majority of protein domains in this family apparently 
do not need disulfide bridges at all to stabilize the 
^-sandwich. Only for a domain in neuroglian has a 
single unusual disulfide bridge connecting strand a 
and g been demonstrated (Huber et oL, 1994). 
However, CD45» which has been predicted to contain 
Fn3 domains (Bork & Doolittle, 1992), has several 
cysteine residues in the respective regions and might 
be a heavily disulfide-bonded example among Fn3 
domains. 



(d) Mutation rates 

Each family identified by sequence comparison has 
clusters of conserved, buried core residues, but these 
are different between the different sequence families. 
Because of this strong dispersion of the different 
families in sequence space, no statement of possible 
divergence between the families can be made. The rate 
of divergence within families can be quantified if 
orthologues of comparable species have been 
sequenced. The considerable differences in mutation 
rates between Tg-like domains in different proteins 
(Table 2) suggest that functional rather than struc- 
tural constraints are the dominant influence on the 
evolution of Ig-like domains. In the examples with 




Figure 5, Example r>f close structural similarity in spite of lack of sequenoe similarity. Superposition of C traces: 
galactose oxidase {bold, IGOF) and cyclodextrin glycosyltransferase (thin, ICGT). Sequence pattern of the corresponding 
families are shown in Figure 4. Structurally equivalent residues (rm.s.d. of 1-5 A over 70 C* pairs): in IGOF residues 543 
to 566, 569 to 574, 589 to 596, 600 to 606, 613 to 025, 627 to 638 and in ICXJT residues 497 to 520, 526 to 535, 539 to 542, 
546 to 552, 555 to 575, 577 to 580. 



Review: The Immunoglobnlin Fold 



317 




8 V s 



PapD 




ig (Fab) 





Ghr 



CDS 




Insulin receptor 



Figure 6. Diversity of binding sites among Ig-like domains. Schema of 5 different modes of binding interaction. 
Interactions with other Ig-like domains as well as the respective topological subtypes are indicated. Ligands, though 
different molecules in each case, are depicted as squares. 



known three-dimensional structures (Table 2), FnS 
domains have slower mutation rates than domains 
with sequence similarity to Igs. However, highly 
conserved domains with sequence similarity to Igs are 
found in N-CAM neural adhesion molecules (N. 
Barclay, persona! communication): they reach a 
sequence identity of about 93% between human and 
rodents (compare with Table 2). Most of the Ig-like 
domains have, however, a sequence similarity much 
lower than the average above 90% of all rodent- 
human protein pairs studied so far {Doolittle, 1987) 
and thus have faster mutation ratea. 

4. Unexpectedly Close Structural Resemblances 

{a) CorrelcUion between structure and sequence 
similarity 

It is well-known that three-dimensional structures 
are much more conserved in evolution than are 
sequences. The relationship between structure and 
sequence similarity is approximately monotonic, i.e. 
they both decrease in parallel at larger evolutionary 
distances, down to a threshold level that corresponds 
to about 25 to 30% identical residues (e.g. Lesk & 
Chothia, 1986; Doolittle, 1987; Sander & Schneider, 
199 1 ; Hilbert ei at, 1993). The correlation was verified 
for the set of Ig-like domains (dataset of Table I 
augmented with additional immunoglobulin do- 
mains). Below 25 to 30% sequence identity, however, 
any correlation is smeared out (data not shown). This 
can be interpreted in several ways. Either structural 
dissimilarity, measured as positional rm.s. deviation, 
levels off at larger evolutionary distances; or, sequence 
identity becomes an inadequate measure of sequence 
similarity below the threshold. Alternatively, convex 
gence in evolution based on physical principles may 
explain the similarity in structure between some of 
the very remotely related pairs. 

(b) A putative sugar-binding domain 

In some cases striking structural resemblances are 
indicative of divergent evolutionary relationships, in 
spite of the apparent lack of statistically significant 
sequence similarity. An example is the similarity 



between Ig-like domains in two apparently unrelated 
enzymes, cyclodextrin glycosyltransferase and galac- 
tose oxidase. Their mutual structural similarity score 
(see Holm & Sander, 1993) is much higher than that 
with the other Ig-like domains of our dataset so that 
they form a distinct subclass of the set of Ig-like 
domains (Figure 2). The structural similarity extends 
over the entire domains, except for one loop insertion 
in galactose oxidase. A total of 70 residues can be 
superimposed with 1-5 A r.m.s.d. (Figure 5), a 
remarkably good agreement at this low level of 
sequence similarity A sequence pattern, derived using 
the structural alignment, clearly discriminates the 
relatives of the two domains from the random 
background of unrelated proteins in database 
searches (data not shown). 

The Ig-like domains of cyclodextrin glycosyltrans- 
ferase and galactose oxidase have a long curled sheet, 
and topologiccJly they are h-types. Galactose oxidase 
contains three structurally distinct domains (Ito 
et al., 1991). The N-terroinal domain is apparently a 
protein module shared with several bacterial 
sialidases (Bork & Doolittle, 1994). The central 
domain has a seven-blade propeller fold and contains 
the catalytically active site and the metal binding site 
although some residues of the C-terminal Ig-like 
domain contribute to metal binding (Ito et al., 1991). 
Cyclodextrin glycosyltransferase also contains dis- 
tinct structural units. Only domain D has an Ig-like 
topology. The two calcium binding sites and the active 
center are located in other regions of the enzyme 
(Klein & SchuU, 1991). Since many modules of 
glycosyltransferases are known to bind carbo- 
hydrates (Gilkes et aL, 1991) and galactose oxidase 
contains at least one mobile module (Bork & Doolittle 
1994) we speculate that the Ig-like module common 
to both enzymes might also be involved in 
carbohydrate binding. 

(c) Similarity of growth hormone receptor domain 1 
and Jihronectin type III repeats 

As already predicted by sequence comparisons 
(Bazan, 1990; Patthy, 1990) and confirmed by X-ray 
crystallography (deVos et al., 1992), the second 
domain of growth hormone receptor is structurally 
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very similar to Fn3 repeats. The all-againfit-all-com- 
parison revealed that the first domain, which does not 
contain the typical Fn3 consensus sequence pattern 
(Bork & Doolittle, 1992) but which has three disulfide 
bridges instead, is also structurally more similar to 
classical Fn3 domains than to other Ig-like domains 
(Figure 2). Thus it might be a fast evolving domain of 
the Fn3 type. Interestingly, a tryptophan at the edge 
of strand b is conserved in both families (Figure 4). It 
is tempting to speculate that tryptophan might be a 
relic of common ancestry although other common 
features reduce to some similarity of the Hydropho- 
bicity patterns around the ^-strands. 



5. Comparisons of Binding Futures 

Ig-like domains occur in functionally extremely 
diverse proteins. The proteins used in this study 
(Table I) represent a rather limited selection yet 
include functionally diverse matrix proteins, recep- 
tors, chaperones and enzymes. They interact with 
extremely different proteins or ligands varying from 
small peptides (e.g. HLA) via hormones (e.g. GHR) to 
giant proteins (e.g. titin oligomer). Even more, the 
determined crystal structures also reveal different 
binding modes; apparently each part of the surface of 
the domain can be used for interaction with other 
molecules (Figure 6). 

One common theme is the interaction with other 
Ig-like domains via the sheets. Whereas in the 
classical Ig variable chains mainly loop regions 
interact with the ligand, the majority of Ig-like 
domains appears to interact v»o their ^-sheets (Figure 
6). Ceil surface molecules such as CD4 and CDS 
contact dom^ns of MHC class I and cla&s II 
molecules, CD2 to LFA of other cells. In addition, 
homo- or heterodiraers can be formed. Even within 
one protein the difTerent structurally similar domains 
have distinct binding functions. For example, in 
matrix proteins such as fibronectin or tenascin 
various binding activities within their Fn3 regions 
have been reported and the partner molecules range 
from carbohydrates via other Ig-like domains to a 
network of molecules. The interaction may be 
mediated by specific regions (e.g. ROD cell surface 
binding motif in fibronectin) or by parts of the 
^-sheets. Often two consecutive domains are involved 
in binding. Examples are growth hormone receptor 
(deVos At al,, 1992), PapD (Holmgren et aL, 1992; 
Slonim et al., 1992) and neuroglian (Huber et aL, 
1994). The latter binds a metal ion in the cleft between 
two Fn3 domains (Huber et aL, 1994); PapD and 
growth hormone receptor bind their major substrate 
in a corresponding region. 

Within each sequence family conserved non-hydro- 
phobic residues might hint at the binding mode of a 
particular family, as has been shown for the PapD -like 
proteins (Holmgren et al., 1992; Slonim et al., 1992; 
see Figure 6). However, the members of most other 
families have distinct binding functions and bind such 
a diverse set of ligands that only structurally 
important residues are conserved. 



6. Common Fold, DifTcFent Folding Pathways? 

A comprehensive structural and sequence compari- 
son of currently avfulable Ig-like domains revealed 
that a common topology of fold is achieved by 
fundamentally different sequences. The derived 
sequence consensus lines (Figure 4) do not reveal 
hydrophobicity patterns common to al) Ig-like 
structures. In some cases a common pattern exists 
despite a structural divergence (e.g. v-type and 
c-type), in others (e.g. 2HHR domain I and FnSs) a 
relatively close structural relationship is not mirrored 
at the sequence level. 

The comparison of all the structures revealed a 
common structural core composed of two sequence- 
adjacent pairs of strands (6-c, e-f). The segment 
between strands c and e is extremely variable in 
sequence. Structurally, the fringes of the common core 
are extremely variable as shown by variable position, 
length and number of strands that are ^tached to the 
common core. Given the extreme sequence diversity 
in the set as a whole, no single interaction (or localized 
set of interactions) can be uniquely identified as a 
principal determinant of the Ig-Uke fold. However, 
the modifications around the common core seen in 
this study might be of interest to protein engineering. 
We propose that the topological subclass could change 
by a single insertion/deletion in the c'-e loop region. 
Disulfide bridges appear to be not essential but might 
stabilize the fold sufficiently to support high 
mutation rates (see tendency in lUble 2). Disulfides 
might even mediate drifting of contacts within the 
core on an evolutionary time scale. 

The observed variability opens up intriguing 
questions concerning the folding pathways of this 
class of proteins. Because of the non-linear 
composition of the common core unit, initiation of 
folding from a single /5-hairpin (see Richardson, 1981 ) 
appears to be unlikely. However, collapse of (several) 
locally formed hairpins is somewhat supported since 
most of the strand'Strand interactions in the Ig-like 
topologies are between sequence-adjacent strands 
(a-b,J~g, cV-c"). The jS-zipper model of Hazes & Hoi 
(1992) proposes initial folding of the fc-c strands (i.e. 
a sheet-sheet contact), because the 6-c loop is usually 
very short in their sample. In Fn3 repeats, however, 
the fr-c loop can contain up to 20 residues. In light of 
the present data, different folding pathways for 
different sequence families cannot be excluded. An 
analysis of the Greek key motif support this view as 
no single folding pathway is likely to fit all Greek key 
structures to which the immunoglobulins belong 
(Hutchinson & Thornton, 1993). 

From the variety of features observed in the 
different Ig-like domains further subtypes and 
modifications of the topology can be expected. 
Indeed, the very recently determined structure of 
cytochrome / revealed a modified Fn3-like domain, 
having additional strands, a-helic^ elements in the 
loops and a more barrel-like arrangement (Martinez 
Bt al., 1994). What nature can do, protein engineers 
can mimic Ig-like domains are a rich source of natural 
and artificial variation on a single structural theme. 
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Note added in proof: The very recently determined three-dimenfiional structure of E. coli ^-galactcwidase (Jacobson, 
R. H., Zhang. X.-J., DuBoae, R. E & Matthews, B. W. (1994)). Nature (London), 369, 761-766 reveals two a-type Ig-like 
domains that flank the catalytic TIM-barrel. 



